Reply to this post | Go Back
View Post [edit]

Poster: Albretch Date: May 15, 2022 9:30pm

Forum: texts Subject: "text cleansing" / removing brownish background from pdf files ...

which techniques are used to remove such background coloring? Take a look at this file:
https://archive.org/download/historyofmateria03languoft/historyofmateria03languoft.pdf
archive.org has taken good care of archiving lots of data, but most (all?) texts available here are not usable as text. Most of it is not readily usable for corpora research.
In this era of "archivism" archiving a text should mean more than just saving it for it to be read by someone else some other time.
The "visually pleasing" aspect people find in pdf files is based on layers of formatting aberrations.
I think the quality of the texts can be enhanced greatly by streamlining some functionality based on
run of the mill open source software and some eye balling of certain targeted segments of text by some determined community (like the pgdp.net kinds of folks). Those texts which need care are in the public domain anyway.
Where do folks interesting in "text cleansing" hang out? A google search on: site:https://archive.org/iathreads "text cleansing" gave me 5 unhinged results and another attempt at: site:https://archive.org "text cleansing" game me nothing.
I think all text should be available in a format using an open specification such as ODF (which is also, very easily translatable to any other format, including pdf). There should also be provisions for plain texts with encoded media specified in some well-defined way.
Something very important that archive should work on before they even start such a cleansing project, is a general, well-defined fluent form of text formatting, from which all kinds of folks would benefit.
I would propose to start such project with like minded individuals.
lbrtchx

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Reply to this post | Go Back
View Post [edit]

Poster: Albretch Date: May 15, 2022 9:30pm

Forum: texts Subject: "text cleansing" / removing brownish background from pdf files ...

Poster:	Albretch	Date:	May 15, 2022 9:30pm
Forum:	texts	Subject:	"text cleansing" / removing brownish background from pdf files ...

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Reply to this post | Go Back View Post [edit]

Poster: Albretch Date: May 15, 2022 9:30pm Forum: texts Subject: "text cleansing" / removing brownish background from pdf files ...

Reply to this post | Go Back
View Post [edit]

Poster: Albretch Date: May 15, 2022 9:30pm

Forum: texts Subject: "text cleansing" / removing brownish background from pdf files ...